Topics

  • Background
  • Univariate Data
  • Bivariate Data
  • Categorical Data
  • Text & Customization

Learning Objectives

For this lecture, the learning objectives include:

  • Create univariate and bivariate plots of data (continuous-continuous & continuous-categorical).

  • Apply varying basic symbologies for representing data in plots.

  • Use named and hex colors to better

The Example Data - Iris

There is a classic data set in statistics called Fisher’s Iris Data Set looking at 50 measurements of sepal and pedal lengths among three species of Iris 1.

Anatomy of a Flower

Iris morphology

The Example Data - Iris

The Example Data - Iris

summary( iris )
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

A Single Vector of Data

sepal_length <- iris$Sepal.Length
head(sepal_length)
[1] 5.1 4.9 4.7 4.6 5.0 5.4

A Single Vector of Data - Histograms

hist( sepal_length )

Arguments to Customize Plots

  • xlab & ylab: The names attached to both x- and y-axes.

  • main: The title on top of the graph.

  • breaks: This controls the way in which the original data are partitioned (e.g., the width of the bars along the x-axis).

    • If you pass a single number, n to this option, the data will be partitioned into n bins.
    • If you pass a sequence of values to this, it will use this sequence as the boundaries of bins.
  • col: The color of the bar (not the border)

  • probability: A flag as either TRUE or FALSE (the default) to have the y-axis scaled by total likelihood of each bins rather than a count of the numbrer of elements in that range.

Density Plots

d_sepal.length <- density( sepal_length )
d_sepal.length

Call:
    density.default(x = sepal_length)

Data: sepal_length (150 obs.);  Bandwidth 'bw' = 0.2736

       x               y            
 Min.   :3.479   Min.   :0.0001495  
 1st Qu.:4.790   1st Qu.:0.0341599  
 Median :6.100   Median :0.1534105  
 Mean   :6.100   Mean   :0.1905934  
 3rd Qu.:7.410   3rd Qu.:0.3792237  
 Max.   :8.721   Max.   :0.3968365  
class(d_sepal.length)
[1] "density"

Density Plots

plot( d_sepal.length )

The Generality of plot()

In R, many objects understand how to plot themselves.

  • Density objects

  • Analyses (regression, ANOVA, etc)

  • points, lines, polygons, & rasters

A Scatter Plot

plot( iris$Sepal.Length, iris$Sepal.Width )

Functional Forms

Plotting of two vectors of data, the first position is on the x-axis and the second is on the y-axis.

 

plot( x, y )

A more functional form of the same plot but designed as a formula where y is a function of x.

 

plot( y ~ x)

 

This is consistent with how we will specify analyses (regression, anova, etc.).

plot() Options

Parameter Description
type The kind of plot to show (’p’oint, ’l’ine, ’b’oth, or ’o’ver). A point plot is the default.
pch The character (or symbol) being used to plot. There 26 recognized general characters to use for plotting. The default is pch=1.
col The color of the symbols/lines that are plot.
cex The magnification size of the character being plot. The default is cex=1 and deviation from that will increase (cex > 1) or decrease (0 < cex < 1) the scaling of the symbols. Also works for cex.lab and cex.axis.
lwd The width of any lines in the plot.
lty The type of line to be plot (solid, dashed, etc.)
bty The ‘Box’ type around the plot (“o”, “1”, “7”,“c”,“u”, “]”, and my favorite “n”)

 

Species Differences in the iris dataset

summary( iris )
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Symbology

symbol <- as.numeric( iris$Species)
symbol
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
[112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[149] 3 3

Species Differences by Symbol

plot( iris$Sepal.Length, iris$Sepal.Width, pch=symbol )

Additional Customizations

plot( Sepal.Width ~ Sepal.Length, data = iris, 
      pch = symbol, bty="n", cex=1.5, cex.axis=1.5, cex.lab = 1.5, 
      xlab="Sepal Length", ylab="Sepal Width")

Named Colors

In R, there are 657 different named colors accessable through the function colors().

sample( colors(), size=5, replace = FALSE )
[1] "orchid"         "orchid4"        "grey90"         "green2"        
[5] "darkslategray1"

 

raw_colors <- sample( colors(), size=3, replace=FALSE)
colors <- raw_colors[ symbol ]
colors
  [1] "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4" 
  [6] "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4" 
 [11] "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4" 
 [16] "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4" 
 [21] "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4" 
 [26] "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4" 
 [31] "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4" 
 [36] "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4" 
 [41] "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4" 
 [46] "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4"  "lightcyan4" 
 [51] "navajowhite" "navajowhite" "navajowhite" "navajowhite" "navajowhite"
 [56] "navajowhite" "navajowhite" "navajowhite" "navajowhite" "navajowhite"
 [61] "navajowhite" "navajowhite" "navajowhite" "navajowhite" "navajowhite"
 [66] "navajowhite" "navajowhite" "navajowhite" "navajowhite" "navajowhite"
 [71] "navajowhite" "navajowhite" "navajowhite" "navajowhite" "navajowhite"
 [76] "navajowhite" "navajowhite" "navajowhite" "navajowhite" "navajowhite"
 [81] "navajowhite" "navajowhite" "navajowhite" "navajowhite" "navajowhite"
 [86] "navajowhite" "navajowhite" "navajowhite" "navajowhite" "navajowhite"
 [91] "navajowhite" "navajowhite" "navajowhite" "navajowhite" "navajowhite"
 [96] "navajowhite" "navajowhite" "navajowhite" "navajowhite" "navajowhite"
[101] "grey4"       "grey4"       "grey4"       "grey4"       "grey4"      
[106] "grey4"       "grey4"       "grey4"       "grey4"       "grey4"      
[111] "grey4"       "grey4"       "grey4"       "grey4"       "grey4"      
[116] "grey4"       "grey4"       "grey4"       "grey4"       "grey4"      
[121] "grey4"       "grey4"       "grey4"       "grey4"       "grey4"      
[126] "grey4"       "grey4"       "grey4"       "grey4"       "grey4"      
[131] "grey4"       "grey4"       "grey4"       "grey4"       "grey4"      
[136] "grey4"       "grey4"       "grey4"       "grey4"       "grey4"      
[141] "grey4"       "grey4"       "grey4"       "grey4"       "grey4"      
[146] "grey4"       "grey4"       "grey4"       "grey4"       "grey4"      

Adding a Legend

plot( Sepal.Width ~ Sepal.Length, data = iris,
      col = colors, pch=20, bty="n", cex=2,
      xlab="Sepal Length", ylab="Sepal Width")
legend(6.5,4.3, pch=20, cex=1.5, col=raw_colors,legend=levels(iris$Species) )

Hex Colors

Color spaces defined by:

  • Red
  • Green
  • Blue

In base-16 no less:

0 1 2 3 4 5 6 7 8 9 A B C D E F

So for 2-digits, that is 256 distinct values for each color

00 → FF

Hex Colors

Represented triplets of RRGGBB preceded by hashtag

raw_colors <- c("#86cb92", "#8e4162", "#260F26")
colors <- raw_colors[ symbol ]

 

Color Theme Generators

Google up something like “Color Theme Generator” and see what you find.

  • One I use coolors to explore various themes.

  • There is a built-in plugin for Color Brewer that makes it easy to integrate (we will use this in ggplot graphics).

Color Brewer in R

 

library(RColorBrewer)
display.brewer.all()

Wes Anderson Palettes

This is a fun package that takes the colorspaces from Wes Anderson films for plotting.

 

library( wesanderson )
names( wes_palettes )
 [1] "BottleRocket1"     "BottleRocket2"     "Rushmore1"        
 [4] "Rushmore"          "Royal1"            "Royal2"           
 [7] "Zissou1"           "Zissou1Continuous" "Darjeeling1"      
[10] "Darjeeling2"       "Chevalier1"        "FantasticFox1"    
[13] "Moonrise1"         "Moonrise2"         "Moonrise3"        
[16] "Cavalcanti1"       "GrandBudapest1"    "GrandBudapest2"   
[19] "IsleofDogs1"       "IsleofDogs2"       "FrenchDispatch"   
[22] "AsteroidCity1"     "AsteroidCity2"     "AsteroidCity3"    

 

wes_palettes$Zissou1
[1] "#3B9AB2" "#78B7C5" "#EBCC2A" "#E1AF00" "#F21A00"

Categorical Data

Mean Sepal Length, by Species

mu.Setosa <- mean( iris$Sepal.Length[ iris$Species == "setosa" ])
mu.Versicolor <- mean( iris$Sepal.Length[ iris$Species == "versicolor" ])
mu.Virginica <- mean( iris$Sepal.Length[ iris$Species == "virginica" ])

meanSepalLength <- c( mu.Setosa, mu.Versicolor, mu.Virginica )
meanSepalLength
[1] 5.006 5.936 6.588

The BarPlot

Plotting quantitative data as a magnitude or amount.

meanSepalLength <- by( iris$Sepal.Length, iris$Species, mean )
meanSepalLength
iris$Species: setosa
[1] 5.006
------------------------------------------------------------ 
iris$Species: versicolor
[1] 5.936
------------------------------------------------------------ 
iris$Species: virginica
[1] 6.588

 

barplot( meanSepalLength, 
         xlab = "Iris Species",
         ylab = "Average Sepal Length")

Boxplot - Information Dense

A boxplot contains a high amount of information content and is appropriate when the groupings on the x-axis are categorical. For each category, the graphical representation includes:

  • The median value for the raw data

  • A box indicating the area between the first and third quartile (e.g,. the values enclosing the 25% - 75% of the data). The top and bottoms are often referred to as the hinges of the box.

  • A notch (if requested), represents confidence around the estimate of the median.

  • Whiskers extending out to shows \(\pm 1.5 * IQR\) (the Inner Quartile Range)

  • Any points of the data that extend beyond the whiskers are plot as points.

 

boxplot( Sepal.Length ~ Species, data=iris, 
         notch=TRUE, ylab="Sepal Length")

Textifying Your Plot

cor <- cor.test( iris$Sepal.Length, iris$Sepal.Width )
cor

    Pearson's product-moment correlation

data:  iris$Sepal.Length and iris$Sepal.Width
t = -1.4403, df = 148, p-value = 0.1519
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.27269325  0.04351158
sample estimates:
       cor 
-0.1175698 
cor.text <- paste( "r = ", format( cor$estimate, digits=4), 
                   "; P = ", format( cor$p.value, digits=4 ), 
                   sep="" ) 
cor.text
[1] "r = -0.1176; P = 0.1519"

 

plot( Sepal.Width ~ Sepal.Length, data = iris, 
      col=colors, 
      pch=20, 
      bty="n", 
      xlab="Sepal Length", ylab="Sepal Width")
text( 6.5, 4.2, cor.text, cex=1.2 )

Questions

If you have any questions, please feel free to post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored